Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Targeted Gene Metagenomic Data Analysis ◾ 281

com”, Silva (16S/18S rRNA) at https://www.arb-silva.de, and UNITE (fungal ITS) at

“https://unite.ut.ee/”. The database must be downloaded and then imported in QIIME2 as

artifact before being used for clustering. For example, we will download the latest release

of curated OTUs from GreenGenes database:

wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/

gg_13_8_otus.tar.gz

tar vxf gg_13_8_otus.tar.gz

rm gg_13_8_otus.tar.gz

Make sure that the URL is a single line with no white space. You can visit the website to

download the latest release.

The files will be extracted into a directory (gg_13_8_otus). Display the contents of this

directory and its subdirectories using “ls” Linux command. You will find four subdirec-

tories: “otus” (for reference OTUs), “rep_set” (for the reference representative sequences),

“rep_set_aligned” (for aligned representative sequences), “taxonomy” (for taxonomy

files), and “trees” (for phylogenetic trees). The files in these directories contain data at dif-

ferent identities (e.g., 99%, 97%, and 94%). Keep this database as we will use it for other

applications.

To use the reference database for clustering, you need to import the file of the database

representative sequences (FASTA file). You need to choose at which identity you wish to

perform clustering. Assume that you want to cluster your sample sequences at 97% iden-

tity, then you can import “rep_set/97_otus.fasta” onto QIIME2 as artifact using “tools

import”. To keep the files organized, we will create the subdirectory “closed_ref_cl_97” for

closed-reference clustering files.

mkdir closed_ref_cl_97

Then, import the database representative sequences into QIIME2 artifact.

qiime tools import \

--type ‘FeatureData[Sequence]’ \

--input-path gg_13_8_otus/rep_set/97_otus.fasta \

--output-path inputs/97_otus-GG_db.qza

Then, you can use the “cluster-features-closed-reference” method of the “q2-vsearch” plu-

gin to perform the closed-reference clustering on the features generated in the derepli-

cation steps. The input artifacts are: dereplicated feature table “derep-yoga-table.qza”,

dereplicated representative sequences “derep-yoga-seqs.qza”, and the reference representa-

tive sequences from the database “97_otus-GG_db.qza”.

qiime vsearch cluster-features-closed-reference \

--i-table inputs/derep-yoga-table.qza \

--i-sequences inputs/derep-yoga-seqs.qza \